In this Section, we analyze the data(airquality) (see ?airquality) consisting of air quality measurements in New York, from Maty to September 1973. The data frame contains with 154 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)[,2] Solar.R numeric Solar R (lang)[,3] Wind numeric Wind (mph)[,4] Temp numeric Temperature (degrees F)[,5] Month numeric Month (1--12)[,6] Day numeric Day of month (1--31)For more details ?airquality
data(airquality)
pairs(airquality)Scatterplot matrix
A simple scatterplot shows that some of the variables have a non-linear relationship.
Let us consider a linear model for the response variable Ozone
airq.lm <- lm(Ozone ~ Temp + Wind + Solar.R, data=airquality)
summary(airq.lm)##
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
Let us plot the results
par(mfrow=c(1,3))
termplot(airq.lm,se=TRUE)
Plots of lm fit
We can also plot the residuals of the model in order to check the model hypothesis
par(mfrow=c(1,2))
plot(airq.lm,which=1:2)
Plots of lm residuals
The lack of normality in the residuals is due to the asymmetry of the response variable Ozone, we can apply a log transform to achieve asymmetry, i.e.:
par(mfrow=c(1,2))
hist(airquality$Ozone, main="Ozone")
hist(log(airquality$Ozone), main ="log(Ozone)")
histograms of Ozone and log(Ozone)
We can now fit a linear model of the \(\log(Ozone)\), does the model look more adecuate?
lairq.lm <- lm(log(Ozone)~ Temp + Wind + Solar.R, data=airquality)
summary(lairq.lm)
plot(lairq.lm)Fitting a linear model we can conclude that there might be some heterokedasticity not accounted by the model. However, the most possible cause is that the relationship between the response variable and the covariates are far from linear.
Let us fit a GAM model, firstly with a single variable, i.e,:
library(mgcv)## Loading required package: nlme
## This is mgcv 1.8-24. For overview type 'help("mgcv-package")'.
airq.gam1 <- gam(log(Ozone) ~ s(Wind,bs="ps",m=2,k=10),
method="REML", select=TRUE,data=airquality)
summary(airq.gam1)##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(Ozone) ~ s(Wind, bs = "ps", m = 2, k = 10)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4185 0.0655 52.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Wind) 2.565 9 6.453 2.76e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.336 Deviance explained = 35%
## -REML = 128.69 Scale est. = 0.49769 n = 116
The results show that the smooth effect s(Wind) is significative (with an approximate \(p\)-value close to zero and edf \(2.56\)). The next Figure shows the smooth wind effect. The increment of the wind speed decreases the \(\log\) Ozone levels (dramatically until 10mph). Note that including the intercept, the smooth effects are centered at zero.
plot(airq.gam1,residuals=TRUE,scheme=1)
Estimated smooth effect of Wind on log(Ozone). Data points are the residuals
gam.check(airq.gam1)
Check plots by gam.check
##
## Method: REML Optimizer: outer newton
## full convergence after 8 iterations.
## Gradient range [-1.455961e-06,1.183099e-06]
## (score 128.6926 & scale 0.4976891).
## Hessian positive definite, eigenvalue range [0.4379927,57.51548].
## Model rank = 10 / 10
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Wind) 9.00 2.56 0.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that, in model airq.gam1 we used k=10 knots, the variable Wind has \(31\) unique values, and hence 10 knots should be enough. We fit a model for the residuals of the fitted gam model
resids <- residuals(airq.gam1)
resids.gam <- gam(resids~s(Wind,k=20,m=2),method="REML",
select=TRUE,data=airq.gam1$model)
summary(resids.gam)##
## Family: gaussian
## Link function: identity
##
## Formula:
## resids ~ s(Wind, k = 20, m = 2)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.137e-15 6.477e-02 0 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Wind) 7.419e-05 19 0 1
##
## R-sq.(adj) = -5.66e-07 Deviance explained = 7.89e-06%
## -REML = 124.14 Scale est. = 0.48659 n = 116
The results show that there is no unexplained variability between the variable and the residuals. Hence, we can conclude that we do not need more knots. The next Figure supports this statement.
plot(resids.gam)
Wind effect vs the residuals of airq.gam1
airq.pred <- data.frame(Wind=seq(min(airquality$Wind),max(airquality$Wind)),
length.out=200)
p <- predict(airq.gam1, newdata = airq.pred, type="response", se.fit = TRUE)
plot(airq.pred$Wind,p$fit, xlab="Wind", ylab="log(Ozone)",
type="l", ylim=c(0,6))
lines(airq.pred$Wind,p$fit + 1.96 * p$se.fit, lty=2)
lines(airq.pred$Wind,p$fit - 1.96 * p$se.fit, lty=2)
points(airquality$Wind,log(airquality$Ozone),cex=.55,col="grey", pch=15)Predicted curve and CI's
Now we add the variable Temp:
airq.gam2 <- gam(log(Ozone)~s(Wind,bs="ps",m=2,k=10)+s(Temp,bs="ps",m=2,k=10),
method="REML",select=TRUE,data=airquality)
summary(airq.gam2)##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(Ozone) ~ s(Wind, bs = "ps", m = 2, k = 10) + s(Temp, bs = "ps",
## m = 2, k = 10)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41852 0.04963 68.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Wind) 2.068 9 1.353 0.000981 ***
## s(Temp) 3.990 9 10.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.619 Deviance explained = 63.9%
## -REML = 101.64 Scale est. = 0.28568 n = 116
Hence, the Temp effect is significative. The next Figure show both smooth effects.
par(mfrow=c(1,2))
plot(airq.gam2,residuals=TRUE,scheme=1)
Estimated smooth effect of Wind and Temp on log(Ozone). Data points are the residuals
We can compare both models using the function anova for a \(F\)-test:
anova(airq.gam1,airq.gam2)## Analysis of Deviance Table
##
## Model 1: log(Ozone) ~ s(Wind, bs = "ps", m = 2, k = 10)
## Model 2: log(Ozone) ~ s(Wind, bs = "ps", m = 2, k = 10) + s(Temp, bs = "ps",
## m = 2, k = 10)
## Resid. Df Resid. Dev Df Deviance
## 1 111.16 55.958
## 2 105.98 31.123 5.1832 24.835
We can conclude that Temp variable is relevant. Note that, strictly, in this case both models are not nested, as the effective dimension of the variable Wind is different when the variable Temp is present. Hence, we use the AIC to confirm the results.
AIC(airq.gam1)## [1] 255.0315
AIC(airq.gam2)## [1] 195.6561
Now, we include the covariate Solar.R. Notice that this variables contains missing values (NA's).
sum(is.na(airquality$Solar.R))## [1] 7
In order to compare the previous model airq.gam2 and the new model that includes Solar.R we will use the same data, so we will omit the missing values and refit the models.
new.airquality <- na.omit(airquality)
airq.gam22=gam(log(Ozone)~s(Wind,bs="ps",m=2,k=10)+s(Temp,bs="ps",m=2,k=10),
method="REML",select=TRUE,data=new.airquality)
airq.gam3=gam(log(Ozone)~s(Wind,bs="ps",m=2,k=10)+s(Temp,bs="ps",m=2,k=10)+s(Solar.R,bs="ps",m=2,k=20),
method="REML",select=TRUE,data=new.airquality)
summary(airq.gam22)##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(Ozone) ~ s(Wind, bs = "ps", m = 2, k = 10) + s(Temp, bs = "ps",
## m = 2, k = 10)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41593 0.05128 66.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Wind) 2.1879 9 1.919 1e-04 ***
## s(Temp) 0.9874 9 8.601 7.73e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.611 Deviance explained = 62.2%
## -REML = 95.215 Scale est. = 0.29194 n = 111
summary(airq.gam3)##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(Ozone) ~ s(Wind, bs = "ps", m = 2, k = 10) + s(Temp, bs = "ps",
## m = 2, k = 10) + s(Solar.R, bs = "ps", m = 2, k = 20)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41593 0.04586 74.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Wind) 2.318 9 2.255 2.44e-05 ***
## s(Temp) 1.852 9 6.128 1.12e-12 ***
## s(Solar.R) 2.145 19 1.397 1.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.689 Deviance explained = 70.7%
## -REML = 86.106 Scale est. = 0.23342 n = 111
AIC(airq.gam22)## [1] 185.7769
AIC(airq.gam3)## [1] 166.0423
Is the Solar.R variable relevant?
The next Figure shows the estimated smooth effects for Wind, Temp and Solar.R covariables (partial residuals are also plotted).
par(mfrow=c(2,2))
plot(airq.gam3,residuals=TRUE)
Smooth effects of Wind, Temp and Solar.R
Data from an experiment to compare growth patterns of two genotypes of soybeans: Plant Introduction #416937 (P), an experimental strain, and Forrest (F), a commercial variety.
library(nlme)
library(lattice)
data(Soybean)
?SoybeanSoy <- Soybean
Soy$logweight <- log(Soy$weight)
xyplot(logweight ~ Time, Soy, groups = Plot, type = c('g','l','b')) # Spaghetti plotlogweight based on Variety and Yearg1 <- lme(logweight ~ Year + Variety + Time + I(Time^2),random = list(Plot = ~ 1 + Time), data = Soy, method = "REML")
# or g1 <- lme(logweight ~ Year + Variety + Time + I(Time^2),random = ~ 1 + Time|Plot, data = Soy, method = "REML")
summary(g1)## Linear mixed-effects model fit by REML
## Data: Soy
## AIC BIC logLik
## -17.19048 22.87305 18.59524
##
## Random effects:
## Formula: ~1 + Time | Plot
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.214970174 (Intr)
## Time 0.003411878 -0.81
## Residual 0.190853283
##
## Fixed effects: logweight ~ Year + Variety + Time + I(Time^2)
## Value Std.Error DF t-value p-value
## (Intercept) -4.996188 0.06724163 362 -74.30201 0.0000
## Year1989 -0.399367 0.05050960 44 -7.90675 0.0000
## Year1990 0.043706 0.05041932 44 0.86685 0.3907
## VarietyP 0.330378 0.04129027 44 8.00136 0.0000
## Time 0.201914 0.00230711 362 87.51813 0.0000
## I(Time^2) -0.001308 0.00002352 362 -55.61852 0.0000
## Correlation:
## (Intr) Yr1989 Yr1990 VartyP Time
## Year1989 -0.418
## Year1990 -0.382 0.488
## VarietyP -0.308 0.003 0.003
## Time -0.745 0.075 0.015 0.000
## I(Time^2) 0.634 -0.076 -0.010 -0.001 -0.957
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.17863140 -0.59460576 0.01326769 0.57401455 3.14914073
##
## Number of Observations: 412
## Number of Groups: 48
xyplot(fitted(g1) ~ Time, Soy, groups = Plot, type = c('g','l','b')) # Spaghetti plotlibrary(SemiPar)
data(pig.weights)
library(lattice)
xyplot(weight~num.weeks,data=pig.weights,groups=id.num,type=c("g","l"))head(pig.weights)## id.num num.weeks weight
## 1 1 1 24.0
## 2 1 2 32.0
## 3 1 3 39.0
## 4 1 4 42.5
## 5 1 5 48.0
## 6 1 6 54.5
lme or lmerThe dataset is a collection of data about some of the passengers, and the goal is to predict the survival (either 1 if the passenger survived or 0 if they did not) based on some features such as the class of service, the sex, the age etc. As you can see, we are going to use both categorical and continuous variables.
VARIABLE DESCRIPTIONS:
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival
(0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home_dest Home/Destination
Full description of data set
Download data here
Read train and test set
train <- read.csv('data/titanic_train.csv',header=TRUE,row.names=1)
test <- read.csv('data/titanic_test.csv',header=TRUE,row.names=1)pclass as exploratory variable. What is the interpretation of the fitted model?These data were recorded by a Spanish survey, as part of a multi-country survey of the abundance of mackerel eggs off the coast of north-western Europe, in 1992.
library(gamair)
library(sm)
library(mgcv)
library(fields)
library(maps)
data(mackerel)
data(mackp)
attach(mackerel)
Latitude=mack.lat
Longitude=-mack.long
# plot the egg densities against location
plot(Longitude,Latitude,cex=Density/150,col=2,asp=.85)
map("world",add=TRUE,fill=TRUE,col="lightgrey")Mackerel eggs abundance
Fit a gam for the spatial locations
m0<-gam(log(Density)~s(Longitude,Latitude,k=30))
# vis.gam(m0,plot.type="contour",color="terrain")
# map("world",add=TRUE,fill=TRUE,col="grey")
summary(m0)##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(Density) ~ s(Longitude, Latitude, k = 30)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.22409 0.06126 52.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Longitude,Latitude) 24.24 27.67 15.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.606 Deviance explained = 64.1%
## GCV = 1.1512 Scale est. = 1.0471 n = 279
Smooth spatial trend
Now we include depth, Temperature and Salinity
ldepth <- log(mack.depth) # logarithm scale
m1<-gam(log(Density)~s(Longitude,Latitude)+s(Temperature)+s(Salinity)+s(ldepth))
par(mfrow=c(2,2))
plot(m1,scheme=2)
Additive model gam fit
summary(m1)##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(Density) ~ s(Longitude, Latitude) + s(Temperature) + s(Salinity) +
## s(ldepth)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.22409 0.05581 57.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Longitude,Latitude) 20.117 24.731 4.459 1.24e-10 ***
## s(Temperature) 2.324 2.914 3.857 0.0147 *
## s(Salinity) 1.000 1.000 0.379 0.5388
## s(ldepth) 2.791 3.529 14.166 3.11e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.673 Deviance explained = 70.4%
## GCV = 0.96301 Scale est. = 0.86901 n = 279
Check residuals
gam.check(m1)
gam.check results
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 11 iterations.
## The RMS GCV score gradient at convergence was 1.094165e-07 .
## The Hessian was positive definite.
## Model rank = 57 / 57
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Longitude,Latitude) 29.00 20.12 0.91 0.035 *
## s(Temperature) 9.00 2.32 1.03 0.665
## s(Salinity) 9.00 1.00 0.96 0.190
## s(ldepth) 9.00 2.79 1.03 0.715
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Remove Salinity
m2<-gam(log(Density)~s(Longitude,Latitude)+s(Temperature)+s(ldepth))
par(mfrow=c(2,2))
plot(m2,scheme=2,1)
Smooth term of m2 model
Mackerel data from a Spanish Survery These data were recorded by a Spanish survey, as part of a multi-country survey of the abundance of mackerel eggs off the coast of north-western Europe, in 1992. This data exhibit rather different features from the remainder of the survey. One of these features is that no eggs were detected at all at a substantial number of the sampling points. This is due to the smaller nets and the need to compensate by taking a larger number of smaller volume samples. The sampling locations are shown in the next Figure.
## The following objects are masked from mackerel:
##
## Density, Temperature
Sampling positions with presence and absence of eggs
We consider a logistic model for the presence of eggs using as the \(\log\) of the depth as covariate
library(mgcv)
logit.gam <- gam(Presence ~ s(ldepth),family=binomial)
plot(logit.gam)logistic regression estimate of the relationship between presence and log of depth.
logit1 <- gam(Presence~s(Longitude,Latitude)+
s(ldepth)+s(Temperature),family=binomial)
summary(logit1)##
## Family: binomial
## Link function: logit
##
## Formula:
## Presence ~ s(Longitude, Latitude) + s(ldepth) + s(Temperature)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6816 0.7591 -4.85 1.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Longitude,Latitude) 28.738 28.97 67.083 8.38e-05 ***
## s(ldepth) 1.000 1.00 3.227 0.0724 .
## s(Temperature) 2.839 3.63 11.264 0.0234 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.322 Deviance explained = 34.2%
## UBRE = -0.085903 Scale est. = 1 n = 417
logit2 <- gam(Presence~s(Longitude,Latitude)+
s(ldepth),family=binomial)
summary(logit2)##
## Family: binomial
## Link function: logit
##
## Formula:
## Presence ~ s(Longitude, Latitude) + s(ldepth)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.9601 0.8548 -4.633 3.6e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Longitude,Latitude) 28.822 28.982 65.65 0.000112 ***
## s(ldepth) 6.326 7.544 10.20 0.279097
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.299 Deviance explained = 33.8%
## UBRE = -0.068872 Scale est. = 1 n = 417
logit3 <- gam(Presence~s(Longitude,Latitude)+
s(Temperature),family=binomial)
summary(logit3)##
## Family: binomial
## Link function: logit
##
## Formula:
## Presence ~ s(Longitude, Latitude) + s(Temperature)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5245 0.7247 -4.863 1.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Longitude,Latitude) 28.711 28.968 73.830 1.03e-05 ***
## s(Temperature) 2.596 3.339 8.205 0.0647 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.313 Deviance explained = 33.2%
## UBRE = -0.08136 Scale est. = 1 n = 417